Continuous Q-Learning in Markov Decision Processes

Under Review

reinforcement learning
Author

Kenny Guo, Ricardo Parada, Helen Yuan, Larissa Xu, Valentio Iverson, Sahan Wijetunga, William Chang

Published

October 25, 2025

Abstract. Reliable simulation agents have become crucial for evaluating and deploying autonomous systems, with recent work showing that large-scale self-play can produce agents that generalize robustly and minimize undesirable behaviors Cornelisse et al. (2025). However, most reinforcement learning methods still rely on restrictive tabular MDP assumptions, limiting their practical impact in real-world applications with continuous, structured state spaces. In these works, we propose novel frameworks for continuous-state and continuous, Poisson-time Markov Decision Processes (MDPs). In continuous state MDPs, the state space is modeled as a compact metric manifold. In Poisson time MDPs, the horizon is modeled as a continuum from \([0, \infty)\), and the learner makes \(H\) decisions at times sampled from a homogeneous Poisson process. In analogy to the classical Lipschitz bandit setting, we assume that the MDP parameters, namely, the transition dynamics and the reward function, are Lipschitz continuous with respect to the underlying metric. Using discretization techniques from this literature, we extend the Q-learning algorithm of Jin et al. (2018) to the continuous state and time settings and establish nearly optimal regret bounds.